#チャンクの設定
knitr::opts_chunk$set(
  warning = FALSE,       
  message = FALSE,   
  cache = TRUE,
  comment = "",
  fig.align = "center"  
)

set.seed(27)             
#チャンクの設定について
#cache = TRUE:処理結果を保存しておく→TRUEでKnitするときに計算時間節約
#eval=FALSE:コードのみを出力し、計算を行わない場合
#echo=FALSE:コードを見せず、結果のみ出力する場合
getwd()

0. Terminology used in this paper

Japanese English translations supplementary explanation
防衛白書 DEFENSE OF JAPAN Annual White Paper,abbreviation: DOJ
防衛省 Ministry of Defense 2007~, abbreviation: MOD
防衛庁 Defense Agency 1954~2007,※1

1. Introduction

 In this study, we will conduct an exploratory text analysis using the DEFENSE OF JAPAN, which has been published by the Defense Agency and the Ministry of Defense. The objective is to visualize the contents of the DEFENSE OF JAPAN by means of quantitative text analysis and to objectively analyze the changes in Japan’s defense policy. Using all the texts of the DEFENSE OF JAPAN published between 1976 and 2021 as the subject of analysis, we attempted to determine how policy makers have analyzed other countries.

2. About the data to be used

 The subject of this study is the “DEFENSE OF JAPAN” published by the Ministry of Defense (the Defense Agency until its promotion to a Ministry on January 9, 2007). The first edition of the “DEFENSE OF JAPAN” was published in 1970, and the second edition has been published annually since 1976. Since the data are not continuous, the year 1970 is excluded, and the 46 years from 1976 to 2021 are used for the analysis. The name of each year’s DEFENSE OF JAPAN will thereafter be used as “DEFENSE OF JAPAN 2021”.
 All of the Japanese versions of the DEFENSE OF JAPAN to be used are available on the Ministry of Defense website and can be viewed by anyone. Therefore, the text was collected by cutting and pasting (DEFENSE OF JAPAN 1976, 1998-2001, 2003-2004, 2020-2021) and scraping with R (DEFENSE OF JAPAN 1977-1997, 2002, 2005-2019). Since the style and amount of text differed greatly depending on the year of publication, we used the easiest method to collect the data for each year. The analysis covered the entire scope of the white paper, including the main text, columns, etc., which could be collected as text, and excluded the material section, images, and figures. However, sentences that are technically difficult to remove, such as explanations of images by using Depending on the amount of text in a given year, it took about one to two hours to scrape all the descriptions in a year’s worth of white papers into text data. The R script that performed the scraping and all the text data that was created will be published separately.  

3. Methodology of Text Analysis

 In the analysis of this paper, R, a free data analysis software, is used; the specific work flow performed in R is (1) pre-processing of collected texts, (2) creation of document-feature matrix, and (3) application of statistical analysis (KeyATM, frequent words, word cloud, and co-occurrence network).
 First of all, when analyzing Japanese text, which is not separated into words by spaces like English, it is necessary to share the text. The collected DEFENSE OF JAPAN data is then used to create a set of documents called a “corpus” using the R package Quanteda. Next, in order to be able to process the data on a computer, it is shared by “tokenization”. At this stage, we delete symbols, particles, auxiliary verbs, and words indicating age (“same year,” “present,” “Heisei,” etc.), as well as words that are not necessary for the analysis of defense white papers (“see,” “chart,” “item,” etc.), and combine words that should be one word but have been split.
 For tokenization in this study, we use the R package “Quanteda”. In the past, morphological analysis tools (e.g., RMeCab) were often used for Japanese preprocessing, but Quanteda allows us to tokenize Japanese without using external tools, and the results are not much different from those of morphological analysis. The next step is to create a “document matrix”, which consists of the tokens created by the tokenization process for each document. In the document matrix, each row corresponds to a document and each column to a word, and the value of a cell indicates the frequency with which each word appears in the document.
 The “document-feature matrix” created in the above manner can be used for statistical analysis. In this study, we conducted “KeyATM”,“Frequent Words”, “Word Cloud”, and “Co-occurrence Network” analysis in order to quantitatively analyze the time series and changes in the descriptions of the DEFENSE OF JAPAN.

#パッケージの読み込み
library("tidyverse")
library("readtext")
library("quanteda")
library("stm")
library("keyATM")
library("quanteda.textstats")
library("quanteda.textplots")
library("seededlda")
#ワーキングディレクトリの設定
setwd("/mnt/c/users/shimi/sotsuron")

4.Text preprocessing

4-1.Reading text data

modtext<-readtext(file="./hakusho/*.txt",
                  encoding = "utf-8") %>% 
  mutate(year = str_extract(doc_id, "\\d\\d\\d\\d") %>% 
           as.numeric())
mod1976<-readtext(file="hakusho/w1976.txt",
                  encoding = "utf-8")

4-2.Creating a corpus

corp <- corpus(modtext, docid_field = "doc_id")
#corpusは、data.frameもしくは、文字列ベクトルから作成され、文書および文書変数を元の状態で格納する。
corp1976 <- corpus(mod1976, docid_field = "doc_id")
summary(corp)

4-3. Tokenization

toks<-tokens(corp, remove_numbers = TRUE,  remove_punct = TRUE, 
             remove_symbols = TRUE) %>% 
  tokens_remove(c(stopwords("ja", source = "marimo"),
                 "章","第","の","部","と","て","は","に","を","における",
                  "が","し","ば","で","など", "こと","って","なる","において",
                  "できる","進む","ページ","れる","とともに","向け","とる","あっ",
                  "てい","られる","もの","させる","かつ","させる","及び","行う",
                  "I","II","III","受け","んで","戻る","のみ","なら","項目","白書",
                  "三つ","参照","含む","に際して","ます","にし","とも","実施","取り組み",
                  "同年","指摘","わが国","図表","利用","活用","はじめ","両国","重要","使用",
                  "関係","可能","つつ","政府","推進","計画","現在","きた","以上",
                  "昭和","場合","という","各種","とおり","われ","今後","開催",
                  "必要","結果","場合","新た","引き続き","行い","提供","なお",
                  "といった","われる","および","資料","近年","行い","強化","見直し","なく",
                  "又は","目的","以降","同年","各種","見直し","図る","具体","努めて",
                 "平成","令和","昭和","活動"
  )) %>% 
  tokens_select(min_nchar=2)
#tokensは、corpusから作成され、分を語に分割した状態で格納する。tokensは語の位置関係を保持するため、複合語の選択・削除・結合を行える
#ひらがな等いらない語を消す
#トークン化
toks1976<-tokens(corp1976, remove_numbers = TRUE,  remove_punct = TRUE, 
             remove_symbols = TRUE) %>% 
  tokens_remove(c(stopwords("ja", source = "marimo"),
                 "章","第","の","部","と","て","は","に","を","における",
                  "が","し","ば","で","など", "こと","って","なる","において",
                  "できる","進む","ページ","れる","とともに","向け","とる","あっ",
                  "てい","られる","もの","させる","かつ","させる","及び","行う",
                  "I","II","III","受け","んで","戻る","のみ","なら","項目","白書",
                  "三つ","参照","含む","に際して","ます","にし","とも","実施","取り組み",
                  "同年","指摘","わが国","図表","利用","活用","はじめ","両国","重要","使用",
                  "関係","可能","つつ","政府","推進","計画","現在","きた","以上",
                  "昭和","場合","という","各種","とおり","われ","今後","開催",
                  "必要","結果","場合","新た","引き続き","行い","提供","なお",
                  "といった","われる","および","資料","近年","行い","強化","見直し","なく",
                  "又は","目的","以降","同年","各種","見直し","図る","具体","努めて",
                 "平成","令和","昭和","活動","the","たとえば"
  )) %>% 
  tokens_select(min_nchar=2)
#250回以上出てくる単語をつなげる
col <- toks %>% 
  textstat_collocations(min_count =250)
#10回以上出てくる単語をつなげる
col1976 <- toks1976 %>% 
  textstat_collocations(min_count =10)
col$collocation
col1976$collocation
col$z
col1976$z
toks_comp <- tokens_compound(toks, col[col$z > 26], concatenator = "") %>% 
  tokens_keep(min_nchar = 2)
toks_comp1976 <- tokens_compound(toks1976, col1976[col1976$z > 26], concatenator = "") %>% 
  tokens_keep(min_nchar = 2)
#つながり方がおかしい単語をつなげる
toks_comp<- tokens_compound(toks_comp, pattern = c("フリゲート","人民解放軍",
                                                   "日米同盟","ミサイル防衛","海上輸送",
                                                   "共同演習","ハイレベル交流","ASEAN諸国",
                                                   "安全保障","海上自衛隊","災害派遣",
                                                   "西側諸国","陸上自衛隊","国際情勢",
                                                   "共同訓練","航空自衛隊","軍事情勢",
                                                   "輸送能力","安全保障"))
toks_comp1976<- tokens_compound(toks_comp1976, pattern = c("海上自衛隊","災害派遣","西側諸国",
                                                   "陸上自衛隊","国際情勢","共同訓練",
                                                   "航空自衛隊","軍事情勢","輸送能力",
                                                   "レーダーサイト",
                                                   "安全保障"))

4-4. Create a document-feature matrix

dfm <- dfm(toks_comp) 
dfm <- dfm %>% 
  dfm_trim(min_termfreq = 1500)%>% 
  dfm_remove(pattern = c("防衛", "自衛隊"))
dfm1976 <- dfm(toks_comp1976) 

5. Analysis Results

5-1. KeyATM

 KeyATM (Keyword Assisted Topic Models)2 is a “quasi-supervised learning model,” a statistical model in which a machine learns word relationships from a corpus to classify documents using a vocabulary given by a human as a cue. This model is a method that has been used to compensate for the weaknesses of supervised and unsupervised models, and it is critically important to know which words to choose. 3
 In this study, we would like to clarify the extent to which countries that have often been the focus of attention in defense policy in the last few years have received attention so far, following a timeline. For this purpose, the countries mentioned in the digest of the DEFENSE OF JAPAN for the most recent five years (2017-2021), Part I, “The Security Environment Surrounding Japan,” will be included in the analysis. This is because the countries that are mentioned in the digest, which describes what they want to convey the most in that year, are considered to be the ones that place the greatest importance on Japan’s security. Table 1 shows the names of countries mentioned in the DEFENSE OF JAPAN 2017-2021, starting with the first country mentioned. It should be noted that items with regional names, such as “Middle East and North Africa” and “Europe,” are not included in the analysis because they are less important than the detailed mentions of a single country. In addition, “Russia” was written as “Soviet Union” until the end of the Cold War. Taking these points into consideration, the keywords to be set are “United States”, “North Korea”, “China”, “Russia”, and “Soviet Union”.
 Figure 1 shows the results of analyzing the percentage of each country’s topics in the entire text data of the DEFENSE OF JAPAN1976-2021, without considering the time axis. Figure 2 shows the changes in the percentage of topics for each country along with the time axis, and Figure 3 combines the figures in Figure 2 that show the changes for each country into a single figure. Figure 4 shows the results of the changes in the percentage of topics for non-democracies only, excluding the United States from Figure 3.

Table 1: Countries mentioned in the DEFENSE OF JAPAN 2017-2021 overview
Year 1 2 3 4
2017 United States North Korea China Russia
2018 United States North Korea China Russia
2019 United States China North Korea Russia
2020 United States China North Korea Russia
2021 United States China North Korea Russia
keyatm_docs <- keyATM_read(texts = dfm)
summary(keyatm_docs)
#キーワードのリストを作成
keywords <- list(
  米国    = c("米国"),
  北朝鮮  =c("北朝鮮"),
  中国 = c("中国"),
  ロシア = c("ロシア"),
  ソ連 = c("ソ連"))
#白書全体に占めるそれぞれのトピックの割合
visualize_keywords(docs = keyatm_docs,
                   keywords = keywords)
Figure 1: Percentage of each country’s topics in the overall DEFENSE OF JAPAN
#時系列を入れる
time <- modtext %>%
  mutate(year = str_extract(doc_id, "\\d\\d\\d\\d") %>%
           as.numeric()) %>%
  filter(year >= 1976) %>%
  mutate(time = year - 1975) %>% 
  select(doc_id, time)
time
time$time
out <- keyATM(docs              = keyatm_docs,                         
              no_keyword_topics = 10,  #自分が指定したトピックの他に何トピック抜き出すか                                 
              keywords          = keywords,                      
              model             = "dynamic",                           
              model_settings    = list(time_index =time$time,
                                       num_states = length(time$time)),                
              options           = list(seed = 250,
                                       parallel_init = TRUE,
                                       store_theta = TRUE))
top_words(out, 30)
plot_timetrend(out, time = time$time + 1975)
Figure 2: Changes in the percentage of topics on each country over time
theta_ls <- out$values_iter$theta_iter
#図2を1つの図にまとめて表現するためのコード
theta_1 <- theta_2 <- theta_3 <- theta_4 <- theta_5 <- matrix(NA, length(time$time), length(theta_ls))
for (iter in 1:length(theta_ls)) {
  theta_temp <- theta_ls[[iter]]
  theta_1[, iter] <- theta_temp[, 1]
  theta_2[, iter] <- theta_temp[, 2]
  theta_3[, iter] <- theta_temp[, 3]
  theta_4[, iter] <- theta_temp[, 4]
  theta_5[, iter] <- theta_temp[, 5]
}
lwr_fn <- function(x) quantile(x, probs = 0.025)
upr_fn <- function(x) quantile(x, probs = 0.975)
us <- tibble(lwr = apply(theta_1, 1, lwr_fn),
                upr = apply(theta_1, 1, upr_fn),
                mean = apply(theta_1, 1, mean),
                type = "米国",
                year = 1976:2021)

nkorea <- tibble(lwr = apply(theta_2, 1, lwr_fn),
             upr = apply(theta_2, 1, upr_fn),
             mean = apply(theta_2, 1, mean),
             type = "北朝鮮",
             year = 1976:2021)

china <- tibble(lwr = apply(theta_3, 1, lwr_fn),
                 upr = apply(theta_3, 1, upr_fn),
                 mean = apply(theta_3, 1, mean),
                 type = "中国",
                 year = 1976:2021)

russia <- tibble(lwr = apply(theta_4, 1, lwr_fn),
               upr = apply(theta_4, 1, upr_fn),
               mean = apply(theta_4, 1, mean),
               type = "ロシア",
               year = 1976:2021)

USSR <- tibble(lwr = apply(theta_5, 1, lwr_fn),
                 upr = apply(theta_5, 1, upr_fn),
                 mean = apply(theta_5, 1, mean),
                 type = "ソ連",
                 year = 1976:2021)
#米国、中国、北朝鮮、ソ連、ロシア

bind_rows(USSR,russia,china,nkorea,us) %>% 
  ggplot(aes(x = year, y = mean, color = type)) +
  geom_line() +
  geom_point(aes(shape = type)) +
  theme_bw() +
  xlab("Year") + ylab("Estimated theta") +
  scale_color_hue() +
  theme(legend.title = element_blank())
Figure 3: Changes in the percentage of topics on each country
Figure 4: Changes in the percentage of topics on non-democratic countries

5-2. Frequency count and Word Cloud

 The structure of the DEFENSE OF JAPAN 2021 is as follows: Part I: Security Environment Surrounding Japan, Part II: Japan’s Security and Defense Policy, Part III: Three Pillars of Japan’s Defense (Means to Achieve the Objectives of Defense), and Part IV: Core Elements Comprising Defense Capability, etc. Although the detailed structure changes every year, there is always some description of the military situation and Japan’s defense policy. Therefore, it is possible to clarify the changes in what was emphasized by conducting a frequency count and comparing the results over time. Therefore, we conducted a frequency tabulation using white papers for every five years since 1976. The results of visualizing the top 20 most frequently occurring words in a bar graph, specifying only nouns, are shown in 5-2-1. The frequency tally table was also visualized in a word cloud where words with higher frequency are drawn in a larger font. The results of the analysis are as shown in 5-2-2.

5-2-1. Frequency count

1976.

dfm1976<-dfm(toks_comp1976,remove="") %>% 
  dfm_remove("^[ぁ-ん]+$", valuetype = "regex", min_nchar = 2)
topwords_title1976 <- topfeatures(dfm1976, 30)
fcm1976 <- dfm1976 %>%
  fcm()
fcm_1976_2 <- fcm1976 %>% 
  fcm_select(pattern = names(topwords_title1976)) %>% 
  fcm_remove(pattern = c("平成", "令和"))
size <- rowSums(fcm_1976_2) %>% sqrt
# ワードリストを作成

sorted.freq.list1976 <- sort(topwords_title1976, decreasing = TRUE)

barplot(sorted.freq.list1976[1 : 20], 
        las = 2, 
        cex.names = 1.0,
        xlab = "単語",
        ylab = "頻度")
20 frequently used words in 1976

1981.

20 frequently used words in 1981

1986.

20 frequently used words in 1986

1991.

20 frequently used words in 1991

1996.

20 frequently used words in 1996

2001.

20 frequently used words in 2001

2006.

20 frequently used words in 2006

2011.

20 frequently used words in 2011

2016.

20 frequently used words in 2016

2021.

20 frequently used words in 2021

5-2-2. Word Cloud

1976.

Word Cloud in 1976

1981.

Word Cloud in 1981

1986.

Word Cloud in 1986

1991.

Word Cloud in 1991

1996.

Word Cloud in 1996

2001.

Word Cloud in 2001

2006.

Word Cloud in 2006

2011.

Word Cloud in 2011

2016.

Word Cloud in 2016

2021.

Word Cloud in 2021

5-3.Co-occurrence Network

 In order to clarify the contexts in which these words are used, we conducted a co-occurrence word analysis. Co-occurring words are words that are often used together near the words to be analyzed, and can reveal how they are described and evaluated. We believe that the analysis in the DEFENSE OF JAPAN can reveal the context in which the frequent words are used. In the analysis, we selected the top 30 words that are characteristic of each of the white papers every five years since 1976, and set the threshold for the number of co-occurrences at 0.85.The results of the analysis are as follows. The size of the black circle indicates the frequency of the word, and the thickness of the line indicates the strength of the relationship between the co-occurring words.

1976.

dfm1976<-dfm(toks_comp1976,remove="") %>% 
  dfm_remove("^[ぁ-ん]+$", valuetype = "regex", min_nchar = 2)
topwords_title1976 <- topfeatures(dfm1976, 30)
fcm1976 <- dfm1976 %>%
  fcm()
fcm_1976_2 <- fcm1976 %>% 
  fcm_select(pattern = names(topwords_title1976)) %>% 
  fcm_remove(pattern = c("平成", "令和"))
size <- rowSums(fcm_1976_2) %>% sqrt
fcm_1976_2 %>% 
  textplot_network(
                   min_freq = 0.85,
                   vertex_size = size / max(size)*3,
                   edge_alpha = 0.7,
                   edge_size = 0.7,
                   vertex_labelsize = 5)
Co-occurrence Network in 1976

1981.

Co-occurrence Network in 1981

1986.

Co-occurrence Network in 1986

1991.

Co-occurrence Network in 1991

1996.

Co-occurrence Network in 1996

2001.

Co-occurrence Network in 2001

2006.

Co-occurrence Network in 2006

2011.

Co-occurrence Network in 2011

2016.

Co-occurrence Network in 2016

2021.

Co-occurrence Network in 2021

  1. 新府省(1府12省庁)の英語名称 https://www.kantei.go.jp/jp/cyuo-syocho/name-e.html (Viewed on Feb. 27, 2022)↩︎

  2. keyATM https://keyatm.github.io/keyATM/index.html (2022年1月29日閲覧)↩︎

  3. カタリナック,エイミー・渡辺耕平(2019)「日本語の量的テキスト分析」『早稲田大学高等研究所紀要』11巻、pp133-143↩︎